[WIP][DSV3] Remove keep a copy of GroupedExperts weight, free memory in StateDictAdapter #1585

wwwjn · 2025-08-16T22:44:17Z

Context

Remove keep a copy of GroupedExperts weight, free memory in StateDictAdapter
Add illustration in README about the issue that split() might cause OOM. More details in following figure:

Test

FSDP=8 (FSDP shard dim-0), num_experts = 256

[rank0]:In _split_weight function, weights: <class 'torch.distributed.tensor.DTensor'> torch.Size([256, 2048, 7168]) (Shard(dim=0),)
[rank0]:In _split_weight function, split_weight: <class 'torch.distributed.tensor.DTensor'> torch.Size([1, 2048, 7168]) (Replicate(),)

FSDP=8 (FSDP shard dim-1), num_experts = 256

[rank0]:In _split weights, <class 'torch.distributed.tensor.DTensor'> torch.Size([256, 2048, 7168]) (Shard(dim=1),)
[rank0]:In _split split_weight, <class 'torch.distributed.tensor.DTensor'> torch.Size([1, 2048, 7168]) (Shard(dim=1),)

tianyu-l

Had some comments. Need @fegin 's input as well.

tianyu-l · 2025-08-16T23:05:11Z

torchtitan/models/deepseek_v3/README.md

@@ -61,6 +61,7 @@ python scripts/checkpoint_conversion/convert_from_hf.py <hf_checkpoints_dir> <dc
 Some limitations:
 1. It can't be used to convert HF checkpoint on the fly using GPU DTensor, because of sharding and quantized blocks may not be aligned well and causing silent numerfical incorrectness.
 2. It can't be used for weight sync to generate a state dict of bf16 because fake quantization to fp8 is applied.
+3. When converting GroupedExperts weights from HF separate expert weights on-the-fly, `torch.split()` will cause huge GPU memory usage. This is because torchtitan GroupedExperts' weight has shape `(num_experts, dim1, dim2)`, and by default shard FSDP on dim-0. When we call `torch.split()` in `to_hf()` function on dim-0, this will incur and all-gather and get replicated expert memory.


I thought more about this. Even if FSDP shards on dim-1, EP will shard on dim-0 anyway. So the problem still exists. Let's discuss next week.

Can we perform a redistribute() before split() to ensure the expert parameter is sharded on dim-1? This redistributed, dim-1 sharded parameter will be used exclusively by the split().

With EP it's sharded on dim-0 anyway. Performing this redistribute means at least 1 comm in to_hf and at least 1 comm in from_hf.
If both EP and FSDP dim-0 sharding is used, we'll have strided sharding whose redistribute algo today may not be efficient or even correct.

The redistribution algorithm should be correct, but whether it is going to be efficient, that's debatable. I think it will be more efficient than allgather as less communication is incurred even if it is not the optimal one.

There will should be no extra comm in from_hf as DCP.load will handle the resharding but this resharding can be slow for sure.

tianyu-l · 2025-08-16T23:07:25Z

torchtitan/models/deepseek_v3/model/state_dict_adapter.py

@@ -158,6 +158,9 @@ def to_hf(self, state_dict: dict[str, Any]) -> dict[str, Any]:
                    new_key = new_abstract_key.format(layer_num, expert_num)
                    hf_state_dict[new_key] = split_values[expert_num].squeeze()

+                # Remove the GroupedExperts' weight from the state_dict to free memory
+                del value


I think for loading checkpoint synchronously, this sounds fine.
But for saving, after calling to_hf we may still need the original weights for next training steps.

I see, that's a valid concern. If a user periodically save a checkpoint in HF format, this would be a issue. I checked checkpoint.py, and it only support last_save_in_hf in _save_last_step, and we are not supporting saving HF in between

The adpater is independent of checkpoint.py in torchtitan. In RL weight sync, it will be called without checkpointing.

wwwjn added 2 commits August 15, 2025 17:36

free memory

5db90c7

add README

672f73e

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 16, 2025

wwwjn marked this pull request as ready for review August 16, 2025 22:44

wwwjn requested review from tianyu-l, fegin and wconstab as code owners August 16, 2025 22:44

tianyu-l reviewed Aug 16, 2025

View reviewed changes

wwwjn changed the title ~~[DSV3] Remove keep a copy of GroupedExperts weight, free memory in StateDictAdapter~~ [WIP][DSV3] Remove keep a copy of GroupedExperts weight, free memory in StateDictAdapter Aug 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][DSV3] Remove keep a copy of GroupedExperts weight, free memory in StateDictAdapter #1585

[WIP][DSV3] Remove keep a copy of GroupedExperts weight, free memory in StateDictAdapter #1585

Uh oh!

wwwjn commented Aug 16, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Aug 16, 2025

Uh oh!

fegin Aug 18, 2025

Uh oh!

tianyu-l Aug 18, 2025

Uh oh!

fegin Aug 18, 2025 •

edited

Loading

Uh oh!

tianyu-l Aug 16, 2025

Uh oh!

wwwjn Aug 18, 2025

Uh oh!

tianyu-l Aug 18, 2025

Uh oh!

Uh oh!

[WIP][DSV3] Remove keep a copy of GroupedExperts weight, free memory in StateDictAdapter #1585

Are you sure you want to change the base?

[WIP][DSV3] Remove keep a copy of GroupedExperts weight, free memory in StateDictAdapter #1585

Uh oh!

Conversation

wwwjn commented Aug 16, 2025

Context

Test

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 16, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 16, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fegin Aug 18, 2025 •

edited

Loading